Hanzi Grid Toward a Knowledge Infrastructure for Chinese Character-based Cultures
نویسندگان
چکیده
Abstract. The long-term historical development and broad geographical variation of Chinese character (Hanzi/Kanji) has made it a crosscultural information sharing platform in East Asia. However, due to the lack of proper research framework, the integration of heterogeneous knowledge grounded in Hanzi and its variants has been a thorny problem. In this paper, we propose a theoretical framework for the knowledge representation of Hanzi in the cross-cultural context. Our proposal is mainly based on two resources: Hantology and Generative Lexicon Theory. Hantology is a comprehensive Chinese character-based knowledge resource created to provide a solid foundation both for philological surveys and language processing tasks, while Generative lexicon theory is extended to catch the abundant knowledge information of Chinese characters within its proposed qualia structure. We believe that the proposed theoretical framework will have great influence on the current research paradigm of Hanzi studies, and help to shape an emergent model of intercultural collaboration.
منابع مشابه
Chinese Word Segmentation as LMR Tagging
In this paper we present Chinese word segmentation algorithms based on the socalled LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of and on the Academia Sinica corpus and the Hong Kong City Unive...
متن کاملChinese Word Segmentation as Character Tagging
In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi, tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach...
متن کاملChinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese
Chinese characters are used both in Japanese and Chinese, which are called Kanji and Hanzi respectively. Chinese characters contain significant semantic information, a mapping table between Kanji and Hanzi can be very useful for many Japanese-Chinese bilingual applications, such as machine translation and cross-lingual information retrieval. Because Kanji characters are originated from ancient ...
متن کاملUnsupervised Word Segmentation Without Dictionary
This prototype system demonstrates a novel method of word segmentation based on corpus statistics. Since the central technique we used is unsupervised training based on a large corpus, we refer to this approach as unsupervised word segmentation. The unsupervised approach is general in scope and can be applied to both Mandarin Chinese and Taiwanese. In this prototype, we illustrate its use in wo...
متن کاملRadical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese
The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-direct...
متن کامل